Vipava
Extracting domain-specific terms using contextual word embeddings
Repar, Andraž, Lavrač, Nada, Pollak, Senja
Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.1. Introduction Automated terminology extraction (ATE) refers to the task of extracting meaningful terms from domain-specific texts. Terms are single-word (SWU) or multi-word units (MWU) of knowledge, which are relevant for a particular domain. Since manual identification of terms is costly and time consuming, ATE approaches can reduce the effort needed to generate relevant domain-specific terms. Recognizing and extracting domain-specific terms, which is useful in various fields, such as translation, dictionary creation, ontology generation and others, remains a difficult task.
- Europe > Slovenia > Gorizia > Municipality of Vipava > Vipava (0.04)
- Europe > Slovenia > Gorizia > Municipality of Nova Gorica > Nova Gorica (0.04)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- Asia > Malaysia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)
AHAM: Adapt, Help, Ask, Model -- Harvesting LLMs for literature mining
Koloski, Boshko, Lavrač, Nada, Cestnik, Bojan, Pollak, Senja, Škrlj, Blaž, Kastrin, Andrej
In an era marked by a rapid increase in scientific publications, researchers grapple with the challenge of keeping pace with field-specific advances. We present the `AHAM' methodology and a metric that guides the domain-specific \textbf{adapt}ation of the BERTopic topic modeling framework to improve scientific text analysis. By utilizing the LLaMa2 generative language model, we generate topic definitions via one-shot learning by crafting prompts with the \textbf{help} of domain experts to guide the LLM for literature mining by \textbf{asking} it to model the topic names. For inter-topic similarity evaluation, we leverage metrics from language generation and translation processes to assess lexical and semantic similarity of the generated topics. Our system aims to reduce both the ratio of outlier topics to the total number of topics and the similarity between topic definitions. The methodology has been assessed on a newly gathered corpus of scientific papers on literature-based discovery. Through rigorous evaluation by domain experts, AHAM has been validated as effective in uncovering intriguing and novel insights within broad research areas. We explore the impact of domain adaptation of sentence-transformers for the task of topic \textbf{model}ing using two datasets, each specialized to specific scientific domains within arXiv and medarxiv. We evaluate the impact of data size, the niche of adaptation, and the importance of domain adaptation. Our results suggest a strong interaction between domain adaptation and topic modeling precision in terms of outliers and topic definitions.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.05)
- North America > United States > Washington > King County > Seattle (0.04)
- (6 more...)
Learning Channel Importance for High Content Imaging with Interpretable Deep Input Channel Mixing
Siegismund, Daniel, Wieser, Mario, Heyse, Stephan, Steigele, Stephan
Uncovering novel drug candidates for treating complex diseases remain one of the most challenging tasks in early discovery research. To tackle this challenge, biopharma research established a standardized high content imaging protocol that tags different cellular compartments per image channel. In order to judge the experimental outcome, the scientist requires knowledge about the channel importance with respect to a certain phenotype for decoding the underlying biology. In contrast to traditional image analysis approaches, such experiments are nowadays preferably analyzed by deep learning based approaches which, however, lack crucial information about the channel importance. To overcome this limitation, we present a novel approach which utilizes multi-spectral information of high content images to interpret a certain aspect of cellular biology. To this end, we base our method on image blending concepts with alpha compositing for an arbitrary number of channels. More specifically, we introduce DCMIX, a lightweight, scaleable and end-to-end trainable mixing layer which enables interpretable predictions in high content imaging while retaining the benefits of deep learning based methods. We employ an extensive set of experiments on both MNIST and RXRX1 datasets, demonstrating that DCMIX learns the biologically relevant channel importance without scarifying prediction performance.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > Slovenia > Gorizia > Municipality of Vipava > Vipava (0.04)
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)
Link Analysis meets Ontologies: Are Embeddings the Answer?
Mežnar, Sebastian, Bevec, Matej, Lavrač, Nada, Škrlj, Blaž
The increasing amounts of semantic resources offer valuable storage of human knowledge; however, the probability of wrong entries increases with the increased size. The development of approaches that identify potentially spurious parts of a given knowledge base is thus becoming an increasingly important area of interest. In this work, we present a systematic evaluation of whether structure-only link analysis methods can already offer a scalable means to detecting possible anomalies, as well as potentially interesting novel relation candidates. Evaluating thirteen methods on eight different semantic resources, including Gene Ontology, Food Ontology, Marine Ontology and similar, we demonstrated that structure-only link analysis could offer scalable anomaly detection for a subset of the data sets. Further, we demonstrated that by considering symbolic node embedding, explanations of the predictions (links) could be obtained, making this branch of methods potentially more valuable than the black-box only ones. To our knowledge, this is currently one of the most extensive systematic studies of the applicability of different types of link analysis methods across semantic resources from different domains.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- (15 more...)
- Research Report (0.82)
- Overview (0.66)
Transfer Learning for Node Regression Applied to Spreading Prediction
Mežnar, Sebastian, Lavrač, Nada, Škrlj, Blaž
Understanding how information propagates in real-life complex networks yields a better understanding of dynamic processes such as misinformation or epidemic spreading. The recently introduced branch of machine learning methods for learning node representations offers many novel applications, one of them being the task of spreading prediction addressed in this paper. We explore the utility of the state-of-the-art node representation learners when used to assess the effects of spreading from a given node, estimated via extensive simulations. Further, as many real-life networks are topologically similar, we systematically investigate whether the learned models generalize to previously unseen networks, showing that in some cases very good model transfer can be obtained. This work is one of the first to explore transferability of the learned representations for the task of node regression; we show there exist pairs of networks with similar structure between which the trained models can be transferred (zero-shot), and demonstrate their competitive performance. To our knowledge, this is one of the first attempts to evaluate the utility of zero-shot transfer for the task of node regression.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (10 more...)
- Research Report (0.64)
- Overview (0.48)
- Health & Medicine (0.68)
- Information Technology (0.46)
Propositionalization and Embeddings: Two Sides of the Same Coin
Lavrač, Nada, Škrlj, Blaž, Robnik-Šikonja, Marko
Data preprocessing is an important component of machine learning pipelines, which requires ample time and resources. An integral part of preprocessing is data transformation into the format required by a given learning algorithm. This paper outlines some of the modern data processing techniques used in relational learning that enable data fusion from different input data types and formats into a single table data representation, focusing on the propositionalization and embedding data transformation approaches. While both approaches aim at transforming data into tabular data format, they use different terminology and task definitions, are perceived to address different goals, and are used in different contexts. This paper contributes a unifying framework that allows for improved understanding of these two data transformation techniques by presenting their unified definitions, and by explaining the similarities and differences between the two approaches as variants of a unified complex data transformation task. In addition to the unifying framework, the novelty of this paper is a unifying methodology combining propositionalization and embeddings, which benefits from the advantages of both in solving complex data transformation and learning tasks. We present two efficient implementations of the unifying methodology: an instance-based PropDRM approach, and a feature-based PropStar approach to data transformation and learning, together with their empirical evaluation on several relational problems. The results show that the new algorithms can outperform existing relational learners and can solve much larger problems.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (11 more...)